Processing input longer then model max input token length

okay
So this is my question:
I have a Mistral model with 32k max input tokens
And I am planning on fine-tuning this model for vulnerability detection
Now in my dataset, I have inputs longer then 32k
I want to know how I should feed the model these functions?
Since the model must see the full code, I can’t use truncation
, and since it is vulnerablity detection an classification, the semantic of the input code is important,
has any one encounter this problem?
what do you think is the best method of processing this part of the dataset?

1 Like

You could try a preprocessing model and trim the excessive inputs into acceptable length with summarization possibly. Additionally you could look into abstract syntax trees. The best approach would be a sliding context window on those excessive length inputs and aggregate the overlaps I think

1 Like

so i checked the sliding window, BUT my models do not support it
and regarding trimming the input into multiple chunk, I need specific gide on what to do with the output labels
for example, lets say I am chunking a vulnerable code, but all the code is not vulnerable. how do we decide if a chunk of a code that was part of a vulnerable code is vulnerable or not? even tough that the specific chunk does not contain the vulnerable line!!! this is one of my main questions.

1 Like

You could annotate the dataset where the functions are vulnerable and maybe provide some context for the function? You could also use like a vulnerability CVE dataset maybe? I’m kind of weak in cybersecurity

1 Like